Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Handle allocation errors inside topn #99931

Merged
merged 6 commits into from
Sep 27, 2023

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Sep 26, 2023

This properly handles allocation errors inside of topn by making Block.Builder and Vector.Builder Releasable. The "new way" to deal with block factories is like this:

try (var b = IntBlock.builder(3, blockFactory) {
  b.append(1);
  b.append(2);
  b.append(3);
  return b.build();
}

If anything goes wrong the block factory's close method will be called by the try block and all of the circuit breaking that it reserves will be released.

For this all to work well Block.Builders have to be one-shot. In other words, you can only call .build on them one time. That shifts the accounting from the builder into the block. It is an error to call build twice.

@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Sep 26, 2023
This properly handles allocation errors inside of topn by making
`Block.Builder` and `Vector.Builder` `Releasable`. The "new way" to
deal with block factories is like this:
```
try (var b = IntBlock.builder(3, blockFactory) {
  b.append(1);
  b.append(2);
  b.append(3);
  return b.build();
}
```

If anything goes wrong the block factory's `close` method will be called
by the `try` block and all of the circuit breaking that it reserves will
be released.

For this all to work well `Block.Builder`s have to be one-shot. In other
words, you can only call `.build` on them one time. That shifts the
accounting from the builder into the block. It is an error to call
`build` twice.
@@ -27,31 +29,32 @@ public class CrankyCircuitBreakerService extends CircuitBreakerService {
public static final String ERROR_MESSAGE = "cranky breaker";

private final CircuitBreaker breaker = new CircuitBreaker() {
@Override
public void circuitBreak(String fieldName, long bytesNeeded) {
private final AtomicLong used = new AtomicLong();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm modifying this so I can assert that we release all memory after we break.

@@ -20,7 +22,7 @@ final class BooleanBlockBuilder extends AbstractBlockBuilder implements BooleanB
BooleanBlockBuilder(int estimatedSize, BlockFactory blockFactory) {
super(blockFactory);
int initialSize = Math.max(estimatedSize, 2);
adjustBreaker(initialSize);
adjustBreaker(RamUsageEstimator.NUM_BYTES_ARRAY_HEADER + initialSize * elementSize());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were pretty far off so I took the liberty of making them a bit more accurate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks.

if (elementType == ElementType.UNKNOWN || elementType == ElementType.NULL || elementType == ElementType.DOC) {
continue;
}
params.add(new Object[] { elementType });
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to parameterized tests so it'll pick up new elementTypes by default.

public static BlockFactory blockFactory(ByteSizeValue size) {
BigArrays bigArrays = new MockBigArrays(PageCacheRecycler.NON_RECYCLING_INSTANCE, size);
return new BlockFactory(bigArrays.breakerService().getBreaker(CircuitBreaker.REQUEST), bigArrays);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just felt like a nice place to stick this so I could share it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

BytesRefBlock block1 = builder.build();
BytesRefBlock block2 = builder.build();
BytesRefBlock.Builder builder1 = BytesRefBlock.newBlockBuilder(grow ? 0 : positions);
BytesRefBlock.Builder builder2 = BytesRefBlock.newBlockBuilder(grow ? 0 : positions);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These happened because block builder are one-shot now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually more readably. 👍

public void close() {
while (page.hasNext()) {
page.next().releaseBlocks();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This let's me assert that we've closed all the input pages even if these's an error!

@elasticsearchmachine elasticsearchmachine added Team:QL (Deprecated) Meta label for query languages team and removed needs:triage Requires assignment of a team area label labels Sep 26, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

@nik9000
Copy link
Member Author

nik9000 commented Sep 26, 2023

run elasticsearch-ci/part-2

@nik9000
Copy link
Member Author

nik9000 commented Sep 26, 2023

» [2023-09-26T21:41:09,555][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [javaRestTest-1] fatal error in thread [elasticsearch[javaRestTest-1][esql_worker][T#24]], exiting java.lang.AssertionError: Used bytes: [-24] must be >= 0

There's some kind of double release going on, I think it's on blocks but I'm not sure.

@nik9000
Copy link
Member Author

nik9000 commented Sep 27, 2023

CrankyBreaker is my hero!

    @Repeat(iterations=1000)
    public void testCranky() {
        BigArrays bigArrays = new MockBigArrays(PageCacheRecycler.NON_RECYCLING_INSTANCE, new CrankyCircuitBreakerService());
        BlockFactory blockFactory = new BlockFactory(bigArrays.breakerService().getBreaker(CircuitBreaker.REQUEST), bigArrays);
        try {
            try (Block.Builder builder = elementType.newBlockBuilder(10, blockFactory)) {
                BasicBlockTests.RandomBlock random = BasicBlockTests.randomBlock(elementType, 10, false, 1, 1, 0, 0);
                builder.copyFrom(random.block(), 0, random.block().getPositionCount());
                try (Block built = builder.build()) {
                    assertThat(built, equalTo(random.block()));
                }
            }
            // If we made it this far cranky didn't fail us!
        } catch (CircuitBreakingException e) {
            assertThat(e.getMessage(), equalTo(CrankyCircuitBreakerService.ERROR_MESSAGE));
        }
        assertThat(blockFactory.breaker().getUsed(), equalTo(0L));
    }

@nik9000
Copy link
Member Author

nik9000 commented Sep 27, 2023

OK! The problem is with the BytesRefBuilder. We "estimate" the bytes used, but they aren't really an estimate at all.

@nik9000
Copy link
Member Author

nik9000 commented Sep 27, 2023

I got it! Fix incoming.

@nik9000
Copy link
Member Author

nik9000 commented Sep 27, 2023

Almost there! BytesRefs weren't building in the same way as everything else but they sure tried to. It wasn't that big a deal that we were off before, but this should catch it! Lots more tests incoming too.

@nik9000
Copy link
Member Author

nik9000 commented Sep 27, 2023

OK! That should work.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


@Override
public String toString() {
return "1gb";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. The toString is super helpful here. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I got the old error message from gradle and thought "I have no idea what this means". Easy enough fix!

* Memory used by the {@link BigArrays} portion of this {@link BytesRefArray}.
*/
public long bigArraysRamBytesUsed() {
return startOffsets.ramBytesUsed() + bytes.ramBytesUsed();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@@ -75,6 +75,7 @@ public BooleanBlock expand() {
public static long ramBytesEstimated(boolean[] values, int[] firstValueIndexes, BitSet nullsMask) {
return BASE_RAM_BYTES_USED + RamUsageEstimator.sizeOf(values) + BlockRamUsageEstimator.sizeOf(firstValueIndexes)
+ BlockRamUsageEstimator.sizeOfBitSet(nullsMask) + RamUsageEstimator.shallowSizeOfInstance(MvOrdering.class);
// TODO mvordering is shared
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, of course. 👍

@@ -20,7 +22,7 @@ final class BooleanBlockBuilder extends AbstractBlockBuilder implements BooleanB
BooleanBlockBuilder(int estimatedSize, BlockFactory blockFactory) {
super(blockFactory);
int initialSize = Math.max(estimatedSize, 2);
adjustBreaker(initialSize);
adjustBreaker(RamUsageEstimator.NUM_BYTES_ARRAY_HEADER + initialSize * elementSize());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks.

UNKNOWN((estimatedSize, blockFactory) -> { throw new UnsupportedOperationException("can't build null blocks"); });

interface BuilderSupplier {
Block.Builder newBlockBuilder(int estimatedSize, BlockFactory blockFactory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

close.add(p::releaseBlocks);
}
Collections.addAll(close, builders);
Releasables.closeExpectNoException(Releasables.wrap(close));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be not so bad. 👍

public static BlockFactory blockFactory(ByteSizeValue size) {
BigArrays bigArrays = new MockBigArrays(PageCacheRecycler.NON_RECYCLING_INSTANCE, size);
return new BlockFactory(bigArrays.breakerService().getBreaker(CircuitBreaker.REQUEST), bigArrays);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

BytesRefBlock block1 = builder.build();
BytesRefBlock block2 = builder.build();
BytesRefBlock.Builder builder1 = BytesRefBlock.newBlockBuilder(grow ? 0 : positions);
BytesRefBlock.Builder builder2 = BytesRefBlock.newBlockBuilder(grow ? 0 : positions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually more readably. 👍


// Note the lack of try/finally here - we're asserting that when the driver throws an exception we clear the breakers.
assertThat(bigArrays.breakerService().getBreaker(CircuitBreaker.REQUEST).getUsed(), equalTo(0L));
assertThat(inputFactoryContext.breaker().getUsed(), equalTo(0L));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@After
public void allBreakersEmpty() {
for (CircuitBreaker breaker : breakers) {
assertThat(breaker.getUsed(), equalTo(0L));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be useful. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think we should drag this into AnyOperatorTests - maybe pretty soon.

@nik9000
Copy link
Member Author

nik9000 commented Sep 27, 2023

The part two errors look real! I'll take a look.

@nik9000
Copy link
Member Author

nik9000 commented Sep 27, 2023

run elasticsearch-ci/part-3

@nik9000 nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 27, 2023
@elasticsearchmachine elasticsearchmachine merged commit b3b0733 into elastic:esql/tracking Sep 27, 2023
@nik9000 nik9000 deleted the topn_errors branch September 27, 2023 14:14
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Sep 27, 2023
This adds things like `IntVector.FixedBuilder` which is slightly simpler
to use than constructing the arrays by hand. It also measures bytes used
up front in the circuit breaker. And it'll be easier to integrate it
into framework happening over in elastic#99931 to handle errors in topn.

This also uses it in `mv_` functions.
elasticsearchmachine pushed a commit that referenced this pull request Sep 27, 2023
This adds things like `IntVector.FixedBuilder` which is slightly simpler
to use than constructing the arrays by hand. It also measures bytes used
up front in the circuit breaker. And it'll be easier to integrate it
into framework happening over in #99931 to handle errors in topn.

This also uses it in `mv_` functions.
piergm pushed a commit to piergm/elasticsearch that referenced this pull request Oct 2, 2023
This adds things like `IntVector.FixedBuilder` which is slightly simpler
to use than constructing the arrays by hand. It also measures bytes used
up front in the circuit breaker. And it'll be easier to integrate it
into framework happening over in elastic#99931 to handle errors in topn.

This also uses it in `mv_` functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue Team:QL (Deprecated) Meta label for query languages team v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants